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Abstract; 

Over the past 30 years most climate models have grown from relatively simple representations of 
a few atmospheric processes to a complex multi-disciplinary system. Computer infrastructure 
over that period has gone from punch card mainframes to modem parallel clusters. Model 
implementations have become complex, brittle, and increasingly difficult to extend and maintain. 
Existing verification processes for model implementations rely almost exclusively upon some 
combination of detailed analysis of output from full climate simulations and system-level 
regression tests. In additional to being quite costly in terms of developer time and computing 
resources, these testing methodologies are limited in terms of the types of defects that can be 
detected, isolated and diagnosed. Mitigating these weaknesses of coarse-grained testing with 
finer-grained "unit" tests has been perceived as cumbersome and counter-productive. In the 
commercial software sector, recent advances in tools and methodology have led to a renaissance 
for systematic fine-grained testing. We discuss the availability of analogous tools for scientific 
software and examine benefits that similar testing methodologies could bring to climate 
modeling software. We describe the unique challenges faced when testing complex numerical 
algorithms and suggest techniques to minimize and/or eliminate the difficulties. 

Introduction 

Testing, verification, evaluation and validation are vital aspects of the construction of climate 
models. However, these processes are ingrained into the cultures of modeling centers and are 
often not specifically recognized. [11 Here, we distinguish validation and verification as in Post 
and Kendall f2 ’ where validation refers to comparison with observations and verification refers to 
comparison with analytic test cases and computational products. Testing and evaluation are more 
generic and will be defined more concretely below. 

From a software-focused perspective, our community is dominated by software developed by 
scientists as opposed to software engineers. f ' s The community invests heavily in system-level 
testing which exercises most portions of the underlying code base with realistic inputs. These 
system-level tests verify that multiple configurations of the model run stably for modest time 
intervals as well as certain key invariants such as parallel reproducibility and checkpoint restart. 
Contrariwise, the routine application of fine-grained, low-level tests (unit tests) is nearly 
nonexistent. Notable exceptions are limited to "infrastructure" software such as that of the Earth 
System Modeling Framework (ESMF). Infrastructure is typified by its role in supporting data 
organization and movement such as for parallelism, I/O, and transferring information from, for 
example, ocean grid to atmosphere grid. f4 ' 


The lack of use of unit tests in climate model development is in stark contrast to the practices of 
commercially developed software. Many reasons can be posited for this situation, but we argue 



that the fundamental reasons are a lack of awareness of the benefits of pervasive unit testing as 
well as strong misconceptions about the nature and difficulty of implementing unit tests. We 
argue here that analysis of the model evaluation process reveals the elements of the ingrained 
culture of testing and evaluation. This allows the definition of a more comprehensive approach to 
software testing. Among the benefits of a more formal approach to testing are improved software 
quality, reliability and productivity of model developers. 

Climate Model Practice 

In simplistic terms the process of model development can be conceived as a 2-phase cycle that 
alternates between development and evaluation. During development the programmer extends 
the capabilities of the model by producing new code. Then, during evaluation the programmer 
seeks to verify whether those changes achieve the desired results without breaking other aspects 
of the implementation. Different scales of code changes warrant different types of verification, 
and all successful verifications are tentative as later verifications may reveal flaws. 

A key observation about the above process is that developers naturally scale the magnitude of the 
verification phase to match computational resources and time required for the process. For 
example, if the verification technique requires lengthy simulations, then developers are unlikely 
to perform verification upon each new line of source code. Rather, they are likely to spend one or 
more days programming prior to submitting a verification run. Likewise, if the verification 
process is merely to ensure no compilation errors, developers may perform the verification 
frequently throughout the day. 

Our focus is ultimately on model evaluation processes that are not comparisons with 
observations, i.e. validation. However, it is useful to disentangle testing, verification and 
validation more thoroughly. Validation is a controversial issue in climate modeling centers. 
Indeed, Oreskes at al. |5] argued that formal validation of geophysical models of complex natural 
systems is "impossible." More recent papers studying the philosophy and social science of the 
culture of climate and weather modeling suggest that the extensive comparison with observations 
and the extensive review of climate assessments do stand as validation. ■ 1 

Because of formal programmatic requirements the Data Assimilation Office (DAO now the 
Global Modeling and Assimilation Office) at NASA’s Goddard Space Flight Center was required 
to produce a formal validation plan. m In the DAO validation plan the strategy was taken to 
define a quantitative baseline of model performance for a set of geophysical phenomena. These 
phenomena were broadly studied and simulated well enough that they described a credibility 
threshold for system performance. Important aspects of this validation approach were that it is 
defined by a specific application suite, formally separated validation from development, and 
relied on both quantitative and qualitative analysis. 

In this paper we are focused on the testing and evaluation that are specific to the software itself. 
That is, testing and evaluation that should be completed prior to the submission of the system for 
validation. Numerous studies have demonstrated that the cost of fixing a software defect is 
substantially less if detected early. 8 A primary verification mechanism is a short-duration run of 
the full modeling system. Even when such tests can be performed in a matter of minutes, they are 



often too crude to detect all but the most basic sorts of errors. Defects, therefore, fester until 
more thorough runs are performed and detailed analyses of the results are examined. When 
possible, finer-grained tests which directly verily the specific changes being attempted not only 
execute faster, but also have a higher probability of detecting undesired behavior earlier. 
Professional software engineers have recognized these characteristics of software development 
and have generally moved towards development processes which provide very rapid cycling 
between coding and verification. 

Software Testing 

Software testing can be applied at various levels within a climate model. Coarser-grained tests, 
e.g. those that encompass large portions of the implementation, can be useful for detecting the 
existence of defects, but are generally limited in their ability to isolate specific causes. In 
contrast, fine-grained tests are better at isolating and diagnosing specific defects, but the number 
of tests necessary to cover an entire application may be prohibitively expensive. Coarse-grained 
tests of modem parallel climate models may also require substantial computational resources for 
their application, whereas with appropriate care, fine-grained tests can be implemented to require 
only minimal computational resources. In practice, coarse-grained tests are applied nightly or 
weekly, while fine-grained tests are exercised more frequently to support ongoing development. 
Projects must consider the relative costs and benefits of each category of testing when 
formulating a testing plan. Below we define a categorization of tests. 

System Tests 

The coarsest level of testing is system testing, in that individual tests exercise all or nearly all of 
the underlying implementation. Most system tests are in the form of regression tests - i.e. the 
tests aim to reveal errors that have been introduced to existing functionality. Regression tests 
commonly used in climate modeling seek to answer questions such as: 

• Do the primary model configurations run to completion? 

• Are the results independent of parallel decomposition? 

• Does the process of checkpoint/restart alter the results? 

• Has the computational performance significantly changed? 

Although the process of regression testing is similar to studying the consequences of scientific 
changes to the model, the distinction is important. A regression test is objective and can be 
automated. A scientific change involves tradeoffs on subtle qualities of a model and requires a 
variety of investigations into the model behavior. 

One of the impediments to system testing in most, if not all, climate models is the large number 
of configurations that could be tested. For example, models can be configured with and without 
components such as an ocean, atmospheric chemistry, and aerosols, each of which may have 
alternate implementations. The number of potentially supported combinations grows 
exponentially with the number of components and can limit the extent to which the system is 
thoroughly tested. 



Despite the difficulties induced by the proliferation of model configurations, system level testing 
is where climate models have the greatest investment and maturity. For example, GISS ModelE 
f101 is developed with continuous integration, which allows nightly automated tests which 
checkout the latest development branch and compile and execute -10 model configurations 
under a variety of parallel decompositions. Alerts are emailed if a configuration fails to compile 
or execute, or if the results are not strongly reproducible. Even when the test reports are ignored 
in real time, experience shows that these tests often bound the dates on which a defect was 
introduced and greatly reduce the effort to identify and fix the problem. Other climate teams 
typically have longer integration cycles, but can compensate by more intensive testing at those 
boundaries. 

Integration testing 

Integration testing is the testing of software aggregates and is intermediate between system 
testing and fine-grained unit testing. Although there is much to be said about integration testing, 
the distinction is of little consequence to the arguments in this paper. 

Unit testing 

Unit tests are the finest granularity of testing and are used to verify low level behaviors of an 
implementation. Generally a unit test exercises a single unit (function or subroutine) by 
comparing the output generated from a set of synthetic input values against the corresponding 
output values. Some special unit testing concerns that arise for climate modeling include 
parallelism, checkpoint/restart, and numerical computations. Automated fine-grained testing is 
quite limited in current climate models, with the few counter examples generally associated with 
infrastructure layers. 

Unit testing frameworks 

Until quite recently, software developers dreaded the process of creating unit tests. The level of 
effort for producing tests was perceived to be high and a distraction from producing "real" 
software. Two innovations have influenced this attitude. The first was the creation of language- 
specific unit testing frameworks, e.g. JUnit, which reduce the difficulty of creating and executing 
tests and reporting test results. The second innovation was the adoption of test-driven 
development [U1 in which the conventional software development process is reversed with 
developers creating tests prior to the software to be tested. The change in developer perception of 
the role of unit tests is perhaps best typified by this quotation from Michael Feathers: "The main 
thing that distinguishes legacy code from non-legacy code is tests, or rather a lack of tests." [12 ‘ 
The implication is that the existence of a sufficiently robust suite of tests completely alters the 
experience, productivity, and overall risk when working with a software system. 

Spurred on by the success of JUnit, unit testing frameworks have been created for most modem 
programming languages. There are at least three independent frameworks for Fortran, including 
pFUnit, which also supports testing parallel implementations based upon MPI. Although these 
frameworks vary from language to language, their basic architectures are quite similar. Each 
framework provides a suite of assertion routines, which can be used by the developer to express 



the expected state of output variables after a call to the routine being tested. When an assertion 
fails to hold, the framework logs the name of the test that fails along with the location of the 
assertion and any accompanying user-provided messages for that assertion. 

Test-Driven Development (TDD) 

Test-driven development (TDD) is a major element of agile development processes and consists 
of a short development cycle that alternates between development of tests and development of 
code which enables the tests to pass. First, the developer creates or extends a test which then fails 
because the necessary functionality has not yet been created. Then the developer produces code 
just sufficient to pass the test. Finally, the developer cleans up with an emphasis on removing 
any incidental redundancy. The process encourages progress in the form of rapid, incremental 
steps and is made practical through unit testing frameworks. 

To better understand the TDD workflow and its application in the context of numerical software, 
consider the process of developing a simple ID linear interpolation procedure. The first step is 
to create a test. Usually the first test is extremely simple and is primarily intended to specify the 
desired interface for the procedure to be implemented, e.g. 

assert (0 == interpolate (x=[ 1,2 ] , y=[0,0j , at_x=l) ) 

With this test in place, we begin to implement the interpolate procedure itself, but only so far as 
to enable successful compilation and execution of the test. A trivial implementation that returns 
zero will suffice. For the next test we choose data that have a simple linear relationship: 

assert(l == interpolate (x=[ 0 , 2 ] ,y=[ 0 , 2 ] , at_x=l ) ) 

This test will already compile and execute, but it will fail due to the overly simplistic 
implementation after the first test. We proceed by extending the implementation to pass both 
tests. Additional tests could then be to check that the behavior is correct when there are multiple 
intervals in the data, or when the interpolation is outside the domain, the data are degenerate, etc. 
After each test, the interpolate procedure is extended to ensure that all tests pass. For many it 
will be initially counterintuitive to proceed in such minute steps, but with practice these steps 
happen very quickly and lead to steady predictable development of complex features. 

The practitioners of TDD tout many benefits. Chief among these is the improvement in overall 
developer productivity despite the substantial increase in the total number of lines of code for a 
given piece of functionality. The improved productivity stems from the comprehensive test 
coverage which in turn leads to far fewer fixes to defects later. Tests under TDD are 
continuously exercised and can, therefore, serve as a form of maintainable documentation. 
Developers using TDD may also be more productive due to reduced stress and improved 
confidence that arise from the immediate feedback as functionality is extended. Finally, TDD 
leads to higher quality implementations. This somewhat surprising claim stems from the 
observation that software that is designed to be testable tends to be comprised of smaller 
procedures having shorter argument lists. 



Barriers to Testing 


In this section we explore the barriers, both technical and cultural, to more pervasive adoption of 
testing methodologies in climate models. 

Scientists often perceive testing as an attempt to execute a model, a component, or an algorithm 
with maximal fidelity. Their instinct is to use representative grid resolutions, realistic input 
values, etc. The art of software testing is to use synthetic input values and small grids that 
exercise specific lines of code to produce output values that can be verified by inspection. High- 
resolution grids serve no purpose other than to slow the execution of tests and require more 
memory. The use of realistic input values is appropriate only to avoid branches that are 
unsupported by the implementation, or when independent realistic output values are available. 
Working with synthetic inputs does require some care to ensure that important cases are covered. 
For example, frequent programming errors such as index permutations and wrong offsets can be 
detected by imposing a spatial dependence on test inputs. 

The largest technical barrier to unit testing in climate modeling software is the legacy nature of 
the code base. Large (> 1000 sloe) procedures that rely heavily on the use of global variables are 
difficult to characterize in the form of a unit test. A possible test might be to store representative 
input and output data of a procedure in an external file. Tests could then be developed which 
exercise the procedure with the stored data and notify if any output values have changed. Such 
tests would be of limited value other than for the purposes of refactoring. 

Teams that desire to enhance software testing mitigate the problem of legacy software by 
limiting modification of existing routines. Rather than "wedging" new code into a large 
procedure, they develop a new testable subroutine. In the legacy procedure they insert, only, a 
call to the new routine. Groups also look for opportunities to extract small testable routines from 
large legacy procedures. 

Beyond the burdens of the legacy code base, numerical algorithms present difficulties that are 
absent in many other categories of software. The most basic of these arise from numerical errors 
that originate from both truncation and round off of floating point operations. Such errors are 
problematic because the corresponding tests must not only specify expected values for output 
parameters, but also provide tolerances for acceptable departures from those expected values. If 
the tolerances are too tight, then tests will appear to fail despite producing acceptable results, 
whereas if tolerances are too loose, then tests lose value as they fail to detect some errors. In 
most cases, developers do not have suitable a priori values for such tolerances, even when the 
asymptotic form of the truncation error for an algorithm is known. 

Fortunately, the difficulty of specifying suitable tolerances is not as severe in practice as the 
above discussion indicates. First, note that the tests are verifying the implementation - not the 
asymptotic form of the algorithm. When an algorithm is broken down into sufficiently small 
steps, the resulting pieces are often amenable to simple, even trivial, error analysis. Complicated 
error bounds are largely a consequence of the how errors compound through subsequent stages 
of a calculation, and from the testing perspective, this compounding can be avoided by 
examining each step independently. 



An important concern of climate models is whether small changes to an implementation may 
result in a different basin of attraction for the trajectories and hence a different climate. The 
chaotic nature of the underlying dynamics effectively limits the ability of unit tests to constrain 
the system in this regard, and only long control runs can protect against undesirable changes. 
Likewise, so long as all “local” tests pass, long control runs cannot identify any particular aspect 
of the implementation that is responsible for an incorrect result, though the most recent change 
will often be the assumed offender. Over time, researchers identify additional constraints (and 
thus unit tests) that decrease the frequency of incorrectly altering the simulated climate, but 
verification will always require control runs. 

Another major objection to testing of some numerical algorithms is the general lack of known, 
analytic solutions for realistic algorithms that could be used to derive appropriate values to use in 
tests. Unless an equivalent and entirely independent algorithm is available, tests of such 
algorithms tend to be redundant with the implementation and thus of no real value. As with the 
discussion of numerical errors above, this concern can largely be eliminated by decomposing 
algorithms into distinct, simple steps, for which synthetic test values are readily apparent. For 
pedagogical purposes, consider how testing would be implemented for a procedure which 
computes the area of a circle. 

area = areaOf Circle ( radius ) 

One might choose a small set of trial values for the input radius: 0, 1, 2 and verify that the results 
are 0., 3.14159265, 12.5663706 respectively. These results are not obvious by inspection because 
mental arithmetic with n is problematic. One could replace these literal values with expressions 
involving n, but that the resulting test is very nearly redundant with the algorithm being tested. 
However, from the software engineering perspective, there is a different option: introducing it as 
a second parameter into the interface. We can then use simple values of "jt" such as 1 or 2 for 
testing the implementation of the procedure. One would implement the usual interface for 
areaOf Circ le ( radius ) by passing a hardcoded value of k to the procedure that accepts 
two arguments. This hardcoded value of rt can be tested by inspection or by trigonometric 
identities. 

This methodology for decomposing numerical schemes may seem overly burdensome at first; 
however, we argue that compared to the level of effort of developing a scientific 
parameterization, the effort spent on expressing algorithms in this fine-grained manner would be 
modest, and the return might be immeasurable in the form of early detection of software defects. 
This approach also helps to side step the issue of providing realistic input values for complicated 
physical parameterizations based upon empirical parameters and curve fitting. Of course, if 
realistic input and output data are available, then corresponding tests should be created. More 
often, though, high-level tests for such parameterizations should verify the sequence of sub-steps 
rather than attempting to compose a full numerical comparison. This approach is more natural 
with abstract data types and object-oriented programming, and may require re-engineering a 
relatively small number of interfaces in a model. 

More broadly, the technique of splitting scientific models into very small procedures to enable 
unit testing raises concerns about computational efficiency. If algorithms are split into very small 



procedures with short parameter lists, then the overhead cost in terms of performance in a real 
code could be substantial. This difficulty can be overcome via bootstrapping. Namely, an 
optimized implementation (large monolithic optimized procedure) can be tested against the 
slower fine-grained implementation that is in turn covered by simple, clean unit-tests. Although 
this technique introduces yet more overhead to the development process, the fine-grained 
implementation can itself be a very useful reference for attempts at further optimizing an 
algorithm. 

Applications of Test-Driven Development 

In the Software Integration and Visualization Office (SIVO) at the NASA Goddard Space Flight 
Center, we have aggressively applied TDD to several development projects including two that 
are directly relevant to scientific modeling. The first of these was to develop a parallel 
application, GTRAJ, which calculates the trajectories of billions of air parcels in the atmosphere. 
The other was to develop a parallel implementation, SNOWFAKE, of a numerical model fi3 ‘ that 
simulates the growth of virtual snowflakes. The success of these projects demonstrates that unit 
testing in general, and TDD in particular, can be applied in the modeling arena. These projects 
also revealed the challenges of repressing established programming habits and methodology on 
the part of senior scientific programmers. 

GTRAJ is a complete rewrite of a legacy application, the Goddard Trajectory Model [ ' 4] 
originally developed in IDL. To achieve portability and scalability, SIVO applied TDD to 
rewrite the application in C++ and MPI using a team of senior scientific programmers that were 
relatively inexperienced with unit testing and TDD. The primary numerical algorithms used in 
GTRAJ are numerical interpolation of gridded meteorological data, and temporal integration of 
the interpolated flow field experienced by a given parcel of air. The original model was primarily 
tested by measuring the discrepancy in time reversal of a five-day integration. Although 
seemingly stringent, the requirement of time-reversibility is unable to detect many types of 
errors. During the rewrite the new fine-grained tests revealed two important errors in the original. 
First, the binning algorithm which selects which grid points should be used for interpolation was 
incorrect in some regions with sharp changes in the pressure heights. The "interpolations" would 
then become extrapolations and occasionally extreme inaccuracies would result. The correction 
would have been difficult in the original monolithic implementation, but was straightforward in 
the version developed under TDD. The second error detected by unit testing in the development 
of GTRAJ was the behavior of Runge-Kutta (RK) when parcels pass near the poles in a latitude- 
longitude grid. Examination of the large errors for such trajectories using synthetic test data, led 
to the realization that the usual RK algorithm must be reformulated for curvilinear coordinate 
systems. Wi th this correction, the accuracy of trajectories that pass near the poles was improved 
by orders of magnitude which in turn permitted substantially larger time steps and faster 
execution of the application. 

At several points in the creation of GTRAJ, the primary developers were challenged by the 
process of following TDD. The temptation to immediately begin implementing an algorithm was 
difficult to resist, especially early in the project. Unfortunately, the pressure to skip tests is 
typically highest when the tests do the most good either because the design requires further 
thought, or the expected behavior is harder to specify. Scientific programmers also struggled 



with a tendency to use realistic input values in tests rather than "enlightening" values, but rapidly 
improved. 

In contrast to GTRAJ, the initial serial implementation of SNOWFAKLE was developed by a 
single developer who was experienced in using TDD for components of model infrastructure. 
The project benefited from the precise mathematical specification of the model in the reference 
paper which was translated into a set of basic tests. The initial implementation was completed in 
approximately 12 hours of development time and was comprised of approximately 700 lines of 
source code and 800 lines of tests. The application ran to completion on the first try and 
produced simulated snowflakes in excellent agreement with the reference paper. The simple 
"micro" debugging at each stage of TDD apparently eliminated the substantial debugging 
typically experienced during the integration phase of programs of even less complexity. Other 
encouraging statistics are that procedures averaged under 12 lines and less than 2 parameters 
(arguments). As one might expect, the initial implementation was suboptimal in terms of 
computational performance, but optimized variants of key algorithms were then easily created 
and verified by tests that compared optimized results against the slower baseline 
implementations. For example, one significant cache-based optimization was from fusing the 
sub steps associated with vapor diffusion. The test for the fused procedure was simply to 
compare the results of the fused procedure with a sequential execution of each original sub step, 
Both implementations and corresponding tests were retained. 

Conclusions 

Confidence in the scientific predictions of climate models is contingent upon confidence in the 
underlying implementations of those models. In this paper we have advocated that current 
software development practices are inadequate for gaining such trust, and that the increased use 
of automation and systematic fine-grained testing of model implementations would be a major 
improvement in this regard. Further, such testing has been shown to provide a net improvement 
to developer productivity in other communities through a number of indirect mechanisms. 

Our experience with two applications suggests that this testing methodology is applicable to 
climate modeling, but more relevant examples are needed. Changing established processes in 
any community is difficult, and scientists are understandably skeptical about the costs and 
benefits of our approach. In the near future we intend to apply test-driven development in the 
creation of a parameterized model component from scratch. The early phases of the 
development of such a component are the ideal time to express the various physical requirements 
and constraints as unit-tests, and the results should serve as a strong demonstration of the 
potential of this technique. 

Many technical difficulties must be overcome before unit testing can become pervasive within 
climate models. Chief among these is the extreme difficulty of introducing fine-grained tests 
into the procedures typical of legacy science code. In many cases, this issue can by side- 
stepped by implementing changes and extensions to models as new modules. This is in contrast 
with the common practice of “wedging” changes directly into old procedures and ultimately 
compounding the legacy burden. Fixes to software bugs are a particularly important case to 
follow this approach, as the constructed test serves as protection against re-introduction of the 



defect. In the long tenn more powerful toots must he developed which allow developers to 
efficiently extract disjoint bits of functionality from the legacy layer. In other programming 
languages, especially Java, new generations of powerful software developer tools have been 
shown to be quite effective for such purposes. Photran tI5] is an attempt to introduce some of 
these capabilities for Fortran, but greater investments should be made to bring appropriate tools 
to a state of maturity suitable for routine application to climate models. 
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